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Abstract 


Computer system designers, administrators, and users are interested in perform- 
ance evaluation since their goal is to obtain or provide the maximum performance 
at lowest cost. System performance N^aries enormously from one application domain 
to another, therefore, no single metric can measure the performance of computer 
systems for all applications. Load on various components of the system also affect 
the performance considerably. Several performance evaluation and benchmarking 
tools <*xist. which measure the performance of a computer system for a particular 
workloail. However, very few benchmarks, if any, check the performance of a system 
for varying system loads. Most of the benchmarks are designed to run on an 'idle 
system'. Therefore, they present a measure of ’peak performance' of a system for a 
particular type of workload. In real life, these 'peak performance' figures are of little 
help. 

This tool measures the load on different components of a system and evaluates 
the system performaince by running different synthesized test programs. These pro- 
grams have been selected to approximate the workload of the IIT Kanpur computing 
environment. The results of performance evaluation have been presented along with 
the tiifferent components of current system load. 
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Chapter 1 
Introduction 


In the early tiays of computing, a programmer's main goal was to get a working 
program with little thought about its efficiency. In 1946 \'on Neumann compared the 
speed with which the early computers (including ENIAC) performed multiplication 
when computing ballistic trajectories[McK88]. Herbst et at. in 1955, measured the 
instruction mix of programs running on the Maniac computer. 

Computer system designers, administrators, and users are ail interested in per- 
formance evaltiation. Designers a compare number of alternative designs to find the 
best <lesign. .Administrators compare a number of alternative systems to decide the 
best system for a set of applications. Users compare a number of installed systems 
to find the best system for a particular job. To get maximum performance at lowest 
cost is the key criterion in design, procurement, and use of computer system. 

The performance evaluation of a computer system is an ambiguous task. Various 
people may think of entirely different things when they use the term “performance” . 
The people concerned with the use of large databases tend to think of performance 
as the number of transactions performed per second while people from the scientific 
and technological areas may be interested in the number of floating point operations 
per second. Even with a narrow focus the assessment of performance is not straight- 
forward. Suppose one is interested in only scientific computing. A large variety 
of high performance computers are available for scientific computing, varying from 
vector computers with a limited number of processors sharing common memory, to 
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machines with thousands of processors and distributed memories. The performamce 
range of these machines can vary, sometimes by a factor of thousand or more, de- 
pending on how the problem and the program suit the underlying architecture and 
operating system. 

1.1 Goals of Performance Evaluation 

Goals are an important part of all endeavors. .Any endeavor without goeds is bound 
to fail. Performance evaluation projects are no exception. [Jai 91 ] presents a list 
of common mistakes that can be found in identifying the goals of a performance 
evaluation project. Design of a performance evaluation tool or benchmzirk needs 
to identify what exactly one is trying measure. The common targets for measuring 
performance are CPU and disk I/O. With the current trend towards disk-less or 
(lata-less^vorkstations, the performance of the file server and the network has become 
crucial and so has performance measurement. 

Performance measurement tools are normally used to predict the performance of 
an unknown system on a known, or at least well defined task or set of tasks. The 
performance results are used to make purchase decisions for a new system. Such tools 
can also be used ixs monitoring and diagnostic tools. One can potentially pinpoint 
the cause of poor performance by running a test program and comparing the results 
against a known configuration. Similarly, one can run a test program after making 
a change to determine the improvement or degradation in performance. 

The best program to test a system’s performance is the actual application that 
will run on the system. But that is not always possible because the applications are 
not always ready before the system is purchased. Even if the application is available 
before purchasing the system, the same computer system can perform differently on 
the different runs of the same application as the input data and other parameters may 
vary. Another problem is that very few systems are dedicated to a particular task and 
most of the installations are used for running various applications. If the nature of 
the jobs that different applications are performing is the same then the test programs 

iData-less workstations are the workstations with small capacity local disk. This disk contains 
the OS files and binaries but not the user files 
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can be designed to predict the system performance for those type of jobs. But in 
the hybrid environments present in educational institutions, the same resources are 
used for N-arious applications (unless the application is very specific). Therefore, no 
single number can represent system performance, rather the performance of different 
subsystems (eg CPL, disk I/O etc) is measured and the result is presented as a set 
of numbers. 


1.2 Workload Selection and Benchmarking 

E\'pry measure of performance requires some specification of the workload that is to 
be handled. Practical considerations require that only a few of the many possible jobs 
be selected as representative of the work expected from the system. Some criteria 
are; 


• .Jobs that run most frequently 

• .Jobs that account for most of the system’s time 

• .Jobs with critical response (completion) time requirements 

Understanding the workload is the foundation for performance studies. Once the 
workload is understood, evaluators can design a sequence of tests to investigate 
specific components of performance, culminating in tests that correlate directly with 
a given workload. Workload parameters include job CPU time, job I/O requests, 
I/O service time, job priority, job memory usage and job sleeping time. Values for 
these parameters are selected, such that they can closely approximate the actual 
system load. 

A benchmark is a documented procedure that measures the time needed by a 
computer system to execute well defined computing task. It is assumed that this time 
is related to the performance of the computer system and that somehow the same pro- 
cedure can be applied to other systems so that compeirisons can be made between 
different haxdware/software configurations. [NB7.5] presents a detailed discussion 
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on benchmarking . [Don8(] discusses some pitfaJls in computer benchmarking. Be- 
fore starting benchmark design (or selection), a basic choice must be made between 
synthetic benchmarks and applications benchmarks. 

1.2.1 Synthetic Vs Application Benchmarks 

Synthetic benchmarks[Gaz] are specially designed to measure the performance of 
individual components of a computer system, usually by exercising the chosen com- 
ponent to its maximum capacity. For example Whetstone is a synthetic benchmark 
use<i to measure the floating point performance of a CPU. They should measure a 
speciiic aspect of the system being tested, independent of all other aspects. For ex- 
ample a synthetic benchmark for Ethernet card I/O throughput should result in the 
same or similar figures whether it is run on a 3S6SX-16 with 4 MBytes of RAM or 
a Pentium 200 MMX with 64 .MBytes of RAM. 

.Application benchmarks try to measure the performance of computer systems for 
some category of reaJ-world computing tasks. .A commonly executed application is 
chosen and the time to execute this application is used as a benchmark. 

1.2.2 Low-level Vs High-Level Benchmarks 

Low-level benchmarks{HOW] directly measure the performance of the hardware: 

CPU clock, DRAM and cache SR.AM cycle times, hard disk average access time, 

latency and track-to-track stepping time, etc... Such benchmarks axe used to verify ' 

the figures given in the data sheet of a component. .Another use of low-level bench- 

« 

marks is to check whether a kernel driver is correctly configured for a specific piece 
of hardware. 

High level benchmarks are more concerned with the performance of the hardware- 
driver-OS-compiler combination for a specific aspect of the computer system. For 
example file I/O performance, or even for a specific hardware-driver-0 S-compiler- 
application performance, e.g. benchmarking a specific Web server package on differ- 
ent microcomputer systems, or different Web server packages on the same system. 
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1.3 Motivation 


Several benchmarks exist for measuring the performance of different aspects of com- 
puter systems. All these benchmarks produce one or more numbers and these num- 
bers are presented as the measure of system performance. A well known computer 
system is considered as the reference machine and the results of running these bench- 
marks on the reference machine are published so that one can compare the perform- 
ance of a machine with that of the reference machine. 

IIT Kanpur has a number of computer systems which enjoy a high rating on the 
popular benchmarks. But the system performance obtained in the real environment 
is not very good. The performance of these systems is not up to expectations. The 
high load on these systems is considered the main reason for low performance. Saying 
that the systems are overloaded is not sufficient since the overall system consists of 
several components and it is hard to find the bottleneck. 

The IITK environment consists of several servers connected by a TCP/IP Net- 
work. These servers are accessed through dumb terminals. .X-terminals, and disk-less 
or disk-full workstations. Dumb terminals are attached through terminal servers. X- 
terminais and Workstations run the X-Windows system. Window manager and other 
X-applications run either locally or on one of the servers. User files are served on 
different machines by the Network File Service running on the servers. The work- 
load varies from the ‘‘'edit-compile-execute-debug'’ cycle to highly CPU bound jobs 
such as programs solving differential equations, or computing Fourier transforms, or 
performing multiplication and other operations on large matrices. Some users nm a 
mix of CPU and I/O bound jobs such as programs simulating an assembly line of a 
production unit. 

Running a benchmark designed for a specific workload in such an environment is 
not useful. The need is for a way to relate the current system load (load on different 
subsystems) to its performance. Unfortunately, very little work has been done on 
this. Our primary motivation is to provide a tool which can measure the performance 
of different subsystems (eg. CPU, disk I/O, NFS, Network) along with the current 
load on different subsystems. The result of running a benchmark for measuring the 
performance of a subsystem should be listed with the current system loads which 
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can affect the performance of that subsystem. The program should run periodically 
on a working system and record system performance with reaJ-life load. For the new 
systems the tool should generate artificial random load (that closely approximates 
the real-life load) and measure the system performance. 


1.4 Organization of the Thesis 

First we discuss some popular benchmarks and the workload they assume. Then we 
discuss the scope of this performance evaluation tool. -After this we present the design 
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of our performance evaluation tool. This includes the techniques used for collecting 
system parameters and system load, and the workload different tests generate. In 
results, we present the results of running this tool on the different machines in the 
IITK Computer Center and Computer Science and Engineering Department. In 
conclusions, we will see further enhancements that can be made in the scope of the 
performance evaluation tool. 
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Chapter 2 

An Overview of Benchmarks 


2.1 Introduction 

One oi the oldest measure of CPU speed is MHz. i.e., the processor clock frequency 
in Mega Hertz. This measure is not useful with today's wide range of CPUs as 
the word length and architecture of the CPU matters a lot in determining the CPU 
speetl. .Another popular, though superficial, measureof CPU speed is MIPS (Millions 
of Instructions Per Second). On a family of systems with the same processor, the 
MIPS rating can help judge relative system integer performance. The ultimate apple- 
to-orange phenomenon occurs when we use the MIPS ratings to compare two different 
architectures like RISC and CISC (Reduced and Complex Instruction Set Computer). 
To make sense of MIPS rating users must define the term instruction. A direct 
correlation does not exist between the number of instructions being executed and 
the amount of actual work being accomplished. In the world of super computers, 
the traditional unit is MFLOPS (Millions of Floating Point Operations Per Second). 
This often means 'Peak MFLOPS* the highest rate of floating point operations per 
second, obtainable only in ideal cases. 

Any attempt to give MIPS numbers some useful meaning boils down to running a 
representative program or set of programs. Therefore, it is better to drop the notion 
of MIPS and just measure the speed of these benchmark programs. 

This chapter presents an overview of commonly available benchmarks. These 
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benchmarks are divided into five major categories: CPU performance benchmarks, 
disk I/O and NFS performance benchmarks. Network performance benchmarks. 
Overall Performance Benchmarks, and Miscellaneous benchmarks. 


2.2 CPU Performance Benchmarks 

2.2.1 General purpose CPU benchmarks 
I Wherstone 

The Whetstone benchmark[Wei91, PriS9j was the first program in the literature that 
was (‘xplicitly designed for benchmarking purposes. Its authors are H.J. Curnow 
and B..\. Wichmann from the .National Physical Laboratory in Great Britain. It was 
published in 1976, with .-VLOOL 60 as the publication language. Today it is almost 
exclusively used in its FORTR.\N version, with either single precision or double 
precision for floating-point numbers. 

The Whetstone benchmark owes its name to the Whetstone .A.LGOL compiler sys- 
tem. This system was used to collect statistics about the distribution of ‘Whetstone 
instructions', instruction for the intermediate language used by this compiler, for a 
large number of numerical programs. .A. synthetic program was then designed, con- 
sisting of several modules where each module contains statements of some particular 
type (e.g. integer arithmetic, floating-point arithmetic, if statements, calls), ending 
with a statement printing the results. Weights are attached to the different modules 
(realized as loop bounds for loops around the individual modules statements) such 
that the distribution of Whetstone instructions for the synthetic benchmark matches 
the distribution observed in the program sample. The weights have been chosen 
in a way that the program executes a multiple of one million of these Whetstone 
instructions; benchmark results axe given accordingly as KWIPS (Kilo Whetstone 
Instructions Per Second) or MWIPS (Mega Whetstone Instructions Per Second). 
Whetstone has a high percentage of floating-point operations. .A. large fraction of the 
execution time is spent in mathematical library functions. 
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I Dhnstone 

As the name already indicates. DhrYstone[Wei91, Pri89, Ser] was developed in a sim- 
ilar way to Whetstone, it is a synthetic benchmark published by the Reinhold Weicker 
and Siemens Nixdorf in 1984. The original language of publication is Ada,’ although 
it uses only the '■Pascal subset’ of .^da and was intended to be easily translatable to 
Pascal and C; presently it is mainly used in the C version. 

Dhr\ stone is based on a literature survey on the distribution of source language 
features in non-numeric, system-type programming (operating systems, compilers, 
editors, etc.). It has been observed that in addition to the obvious difference in data 
types (integral types vs. floating point types), numeric and system-type programs 
have other differences also: System programs contain less loops, simpler computa- 
tional statements, more 'if' statements and procedure calls. 

Dhrystone consists of I'i procedures, they are included in one measurement loop 
with f) 1 statements. During one loop (‘one Dhrystone'). 101 statements are executed 
dynamically. The results are usually given in "Dhrystones per second’. Dhrystone 
contains no floating-point operations in its measurement loop. A considerable per- 
centage of execution time is spent in string functions. In extreme cases this.number 
goes up to 40%. Dhrystone is very sensitive to optimization and results will appear 
erratic unless degree of optimization is carefully tracked. 

I Digital Review 

Digital Review magazine[Pri89] has compiled a set of benchmark routines that mixes 
34 individual integer and floating-point routines. The test itself stresses floating- 
point performance. This large benchmark contains over 3,000 lines of FORTRAN 
code. The Digital Review benchmark does not perform any verification of test res- 
ults; these results usually appear as a list of a geometric mean of all tests performed 
(in seconds). At a secondary level the test normalizes relative comparisons among 
■various svstems to the Digital MicroVAX II. which is taken as 1.0. These units are 

called MicroVAX units of processing (MVUPs). 

Users have criticized this benchmark for its odd structure and unusual instruction 
mix that does not accurately represents the real-world program flow. Initializing the 
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routines within the timing loops, rather than running the actual benchmark code, 
consumes a large amount of time. Digital Review magatzine has taken steps to revise 
its benchmark and the new version is now called CPU2. CPU2 is a floating-point 
intensive series of FORTRAN programs and consists of thirty-four separate tests. 
The benchmark is most rele\'ant in predicting the performance of engineering and 
scientific applications. 

2.2.2 Integer Only benchmarks 

I Sim 

The HI.M[F.\Q, Ser] program finds k best non-intersecting alignments between two 
sequenc<‘s or within one sequence. The program is based on an algorithm presented 
by Xiaoqiu Huang and Webb Miller of the Pennsylvania State University 

Using <lynamic programming techniques, SIM is guaranteed to find optimal align- 
ments. The alignments are reported in order of similarity score, with the highest 
scoring alignment first. The k best alignments share no aligned pairs. SIM requires 
space proportional to the sum of the input sequence lengths and the output alignment 
lengths. Thus SIM can handle sequences of tens of thousands, or even hundreds of 
thousands, of base pairs on a workstation. 

I Fhonrstones 

Fhourstones(Ser, FAQ] is a small integer-only program that solves positions in the 
game of conaect-4 using exhaustive search with a very large transposition table. The 
program is written in C. 

I Hesipsort 

Heapsort[Ser, FAQ] is an integer program that uses the “heap sort” method of sorting 
a random array of long integers up to 2 megabytes in size. Benchmark result is given 
ss MIPS rating, based upon the program run time for one iteration. A gcc (GNU C 
Compiler) 2.1 unoptimized assembly dump count of instructions per iteration for a 
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i4b6 machine is taken as the reference. A heapsort MIPS rating is not representative 
of a ‘typical’ instruction mix. 

I Hanoi 

Hanoi [Ser. FAQ] is an integer program that solves the ‘Towers of Hanoi' puzzle using 
recursive function calls. 

2.2.3 Scientific Benchmarks 

I Unpack 

As ilescribed by its authors. Linpack(MBL91. PriS9. Wei91] didn't originate as a 
benchmark, it was first just a collection (a package, hence the name) of linear algebra 
subroutines often used in FORTRAN programs. It was first published in 1976 by 
Jack Dongarra of the University of Tennessee. The program operates on a large 
matri.x (■2-(iimensional array): however, the inner subroutines manipulate the matrix 
as a one-<limensional array. The matrix size is 100x100. 

The results are usually reported in terms of MFLOPS, the number of floating- 
point operations executed by the program can be derived from the array size. This 
terminology implies that the non-floating point operations are neglected or, stated 
otherwise, that their execution time is included in that of the floating-point opera- 
tions. Linpack has a high percentage of floating point operations. However, only a 
few floating-point operations are actually used. For example, there are no floating- 
point divisions in the program, and no mathematical functions axe used at all. The 
execution time is almost exclusively spent in one small function. This means that 
even a small instruction cache will show a very high hit rate. 

The Java version of Linpack is also available now[Ver]. 
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I Livermore Fortran Kernels 

The -Livermore Fortran Kerneis[MBL91], also called the -Lawrence Livermore Lops’, 
consist of 24 ‘kernels .i.e. inner loops of numeric computations from different areas 
of the physical sciences. The author, F.H. McMahon (Lawrence Livermore National 
Laboratory, Livermore), has collected them into a benchmark suite and has ad- 
ded statements for time measurement. The individual loops range from a few lines 
to about one page of source code. The program is self-measuring and computes 
MFLOPS rates for each kernel, for three different vector lengths. These kernels con- 
tain a high amount of floating-point computations and a high percentage of array 
accesses. 

The performance metric reported for the Livermore Fortran Kernels is the MFLOPS 
rate for each individual kernel along with the overall average, harmonic, and geomet- 
ric means for the entire suite. For vector machines, an all scalar-compilation run is 
required to measure the basic scalar performance range of the processor. 

I NAS kernel benchm&rk 

The .NAS (Numerical .Aerodynamic Simulation) [MB L91] Kernel Benchmark was de- 
veloped at the NAS.A .Ames Research Center. It consists of seven test kernels. 
Each individual test kernel consists of a loop that iteratively calls a particular sub- 
routine. These subroutines were supported by a number of NAS.A Ames scientists 
and programmers involved with computational fluid dynamics projects. All the seven 
selected programs emphasize the vector performance of a computer system. In fact, 
almost ail of the floating-point operations axe contained in the loops that axe comput- 
able by vector operations. Input data array were generated using a portable pseudo 
random number generator. Each of the kernels is independent from the others, i.e. 
none depends on the results calculated in a previous test program. 
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I EDN Benchmarks 

The EDN Benchmarks [Wei91| were developed by a group at CMU for the project 
‘Military Computer Family', it was published by EDN in 1981. Originally, the pro- 
grams were written in several assembly languages (LSI-11/23. 8086, 68000. Z8000); 
the intention was to measure the speed of the microprocessor without also measuring 
the compiler s quality. The benchmark programs include programs for string search, 
bit test/set/reset. linked list insertion, quicksort, and bit matrix transformation. 
The C version of the benchmarks is also available but that is not standard and the 
programs are disseminated in an informal way only. 

I SPICE 

The general-purpose SPICE (Simulation Program with Integrated Circuit Emphasis) 
[Pri89] came from the University of California at Berkeley. This benchmark makes 
heavy use of both integer and double-precision, floating-point calculations (the floating- 
point operations are not vector oriented). Because it is quite large, the program is a 
good test of system instruction and data-cache performance. SPICE accepts a circuit 
description as input and simulates the design. The user can monitor currents and 
voltages at various circuit locations. UC Berkeley and several system vendors have 
distributed various input data packs for simulation of different types of circuits. 

I Stanford Integer and Floating Point Benchmarks 

The Stanford Integer[Wei91, Pri89] benchmark is written in the C programming 
language. The suite consists of small programs that use algorithms to solve real- 
world problems. Some of these small programs include the Towers of Hanoi and the 
Eight Queens puzzles, multiplication of integer matrices, and the quick and bubble 
sorts. Yet another routine inserts and recursively searches a binary tree. Test 
measurements consist of the geometric mean of all results. 

Stanford Floating Point[Wei91, Pri89] benchmark uses tight loops and a large 
proportion of floating-point code. The routine’s susceptibility to code optimization 
by high-quality compilers affects the results. The suite consists of the feist Fourier 
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transform (FFT) and matrix multiplication (MM) tests. The first test tj-pically 
computes a 256-point, single-precision FFT 20 times. The second test multiplies 
two 40x40 single-precision matrices. 

I Los Alamos Benchmarks 

The Los .Alamos National Laboratory ( LANL) maintains a set of 13 portable FORTRAN- 
77 benchmark[MBL9l] programs that represent the Laboratory workload. These 
benchmarks are intended to represent the types of algorithms of current importance 
to L.-\.XL as well as the characteristic style of coding found within actual production 
codes run at the Laboratory. The L.\NL benchmark set consists of a hieratrchy of 
codes: hardware demonstration kernels, basic routines, and stripped down applica- 
tion programs since they come closest to representing the true LANL workload. 

Two of the original benchmark programs measure rates in MFLOPS for primitive 
vector operations as a function of vector length. The MFLOPS extracted using these 
benchmarks normally over represent the computational power, than the actual results 
which can be obtained for a real application code. 

I Sieve of Eratosthenes 

One of the most popular programs for benchmarking in the world of small PC’s 
is the ’Sieve of Eratosthenes’[Wei91|, sometimes also called 'Primes’. It computes 
all prime numbers up to a given limit (usually 8192). The program has some un- 
usual characteristics: 33% of the dynamically executed statements axe assignments 
of a constant, only 5% are assignments with an expression at the right hand side. 
There are no ‘while’ statements and no procedure calls, 50% of the statements axe 
loop control evaluations. All operands axe integer operands, 58% of them axe local 
variables. 

I Dodoc 

This is a 5,300 line FORTRAN program [Pri89], which simulates the operations 
within a nuclear reactor. This program accurately tests instruction-fetch bandwidth 
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and scalar floating-point performance. Compilers can vectorize very little of the 
code. The routine uses the Monto Carlo method of simulation in which an iterative 
process converges on an expected result. The routine was originally designed as a 
check of both compiler and real-world functions. Normalized results appear in terms 
of the ratio of CPU time needed to perform the test versus an arbitrarily defined 
reference. This R factor is normalized where 100 equals the performance of the IBM 
370 Model 168. The expression is: R=48,671/seconds of processor time. Larger R 
factors mean to higher system performance. 

2.2.4 High Performance Scientific Computing 
I EuroBen Group Benchmarl^ 

The EuroBen[vdS9l] group was established in mid-1990 by a group of people that 
was concerned about obtaining the performance profile of high-performance scientific 
computers. As the founders of EuroBen believe that characterization of the perform- 
ance for high-performance scientific computers cannot be done by a single perform- 
ance measure, especially where vector and parallel architectures are involved. A 
gradetl approach was used to ensure a more general assessment of the performance. 
These program range from very simple to complete algorithms that are important 
in certain application areas. The simple programs give basic information on the per- 
formance of operations, intrinsic functions, etc., which can do much to identify the 
strong and weak points of a machine. When one wants to extract more information 
one should conduct the next level of tests which contains simple but frequently used 
algorithms like FFTs, random number generation, etc. When one wants more inform- 
ation one has to run more modules. All the modules axe written in FORTRAN-77 
and the precision required is at least 64 bits. 

I The Perfect Benchmark 

Perfect (Performance Evaluation for Cost-Effective Transformations)[MBL91] bench- 
marking activity was started in 1987 by a group of academic and industrial collabor- 
ators. The original ambitious goals were to define an applications-based methodology 
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for supercomputer performance evaluation and in the process assemble and port a 
suite of scientifically relevant codes to numerous high performance computing ma- 
chines. The resulting set of codes, known as the Perfect benchmarks, consists of 13 
programs drawn from a variety of scientific and engineering fields. With over 60,000 
lines of FORTRAN source listing, several of these programs have been successfully 
ported to over 30 machines. 

I Flops 

Flops[F;\Q, Ser] estimates the MFLOPS rating for specific floating-point addition 
(F.ADD). iloatmg-point subtraction (FSUB), floating-point multiplication (FMUL), 
and floating-point division (FDIV) instruction mixes. Four distinct MFLOPS ratings 
are provided based on the FDI\' weightings from 25% to 0% and using register-to- 
register operations. This benchmark works with both scalar and vector machines. 

I STREAM 

STREAM [FAQ] is a synthetic benchmark which measures sustainable memory band- 
width with and without simple arithmetic, based on the timing of long vector opera- 
tions. STREAM is available in FORTRAN and C versions, and the results are used 
by all major vendors in high performance computing. 

2.3 Disk I/O and NFS Performance Benchmarks 
2.3.1 lOBENCHP 

IOBENCHP[FAQ] was written by Barry Wolman of Prime Computer. lOBENCHP 
is a multi-stream benchmark that uses a controlling process (iobench) to start, co- 
ordinate, and measure a nximber of ‘user’ processes (iouser). 

There are four workloads associated with the IOBENCH: short, ref, test, and 
elapsed. The ‘short’ workload takes only a few seconds to run and just verifies that 
lOBENCHP was made properly. The two 2MB files used by the ‘short’ workloaH 
are created automatically in the result directory. The ‘ref’ workload uses four 20MB 
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files with 16, 32, 48, 64, and SO users using record lengths of 100 bytes, 4096 bytes, 
and 6192 bytes. For each run, 10,000 ‘cycles' are spread across the number of users. 
The ‘ref workload should talce about 5-20 minutes to run depending on number of 
disk controllers and type and number of disks used. The ‘test’ workload is the same 
as the ‘ref’ workload, except no mounting is done. The ‘elapsed’ workload uses the 
same four ‘20MB files and uses 50,000 ‘cycles’ spread across 50 users with a 4096 
byte record length. The elapsed workload will take at least 5 minutes. Both the ‘ref’ 
and ‘elapsed’ workloads require that mountable file systems be specified; these will 
be unmounted (if necessary) before the run, mounted, and then left unmounted at 
the end of the run. 

2.3.2 lOZONE 

This test[F.\Q] writes a X MEG.A.BYTE sequential file in Y byte chunks, then re- 
winds it and reads it back. The size of the file should be big enough to factor out 
the effect of any disk cache. 

The file is written (filling any cache buffers), and then read. If the cache is greater 
than X .\1B, then most if not all the reads will be satisfied from the cache. However, 
if it is less than or equal to half of X MB, then NONE of the reads will be satisfied 
from the cache. This is because after the file is written, a X/2 MB cache will contain 
. the upper half of the test file, but the program will start reading from the beginning 
of the file (data which is no longer in the cache). In order for this to be a fair test, 
the length of the test file must be AT LEAST 2X the amount of disk cache memory 
in the system. If not, then one is really testing the speed at which the CPU can read 
blocks out of the cache (not a fair test) 

10 ZONE does not normally test the raw I/O speed of the disk or system. It tests 
the speed of sequential I/O to actual files. Therefore, this measurement factors in 
the efficiency of the file system. C compiler, and C runtime library. It produces a 
measurement which is the number of bytes per second that the system can read or 
write to a file. 
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2.3.3 NFS Stone 


NFS Stone[FAQ] takes two arguments. The first one is the directory to run the tests 
in. The second one is the lock file name. The program sets itself up, then tries to 
obtain the lock. This allows the benchmark to sync up. When it has the lock, it 
releases it right away to allow someone else to get it. This should happen fast enough 
that the clients will start within a second or two of each other. NFS Stone performs 
the various operations such as ‘make dir’, ‘remove dir', ‘copy’, 'move', ‘detete dir', 
‘read tile', 'write file', ‘load large programs' etc. Different weights are assigned to 
different operations, and a term ‘NFSSTONE operations' is derived using a formula. 
The l)enchmark presents the performance as ‘NFSSTONEs per second’. 

2.3.4 NHFS Stone 

This benchmark[F.\Q] is intended to measure the performance of file servers that fol- 
low the NFS protocol. The work in this area continued within the L.ADDIS group and 
finally within SPEC. The SPEC benchmark 097.LADDIS (SFS benchmark suite.is 
intended to replace Nhfsstone. it is superior to Nhfsstone in several aspects (multi- 
client capability, less client sensitivity), 

Nhfsstone is used on an NFS client to generate an artificial load with a particular 
mix of NFS operations. It reports the average response time of the server in milli- 
seconds per call and the load in calls per second. The program adjusts its calling 
patterns based on the client’s kernel NFS statistics and the elapsed time. Load can 
be generated over a given time or number of NFS calls. Because it uses the kernel 
NFS statistics to monitor its progress, nhfs- stone cannot be used to measure the 
performance of non-NFS filesystems. Since it is measuring servers, it should be run 
on a client that will not limit the generation of NFS requests. This means it should 
have a fast CPU, lots of memory, a good ethernet interface and the machine should 
not be used for anything else during testing. Nhfsstone assumes that all NFS calls 
generated on the client are going to a single server, and that all of the NFS load on 
that server is due to this client. To make this assumption hold, both the client and 
server should be as quiescent as possible during tests. If the network is heavily util- 
ized the delays due to collisions may hide any changes in server performance. High 
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error rates on either the client or server can also cause delays due to retransmissions 
of lost or damaged packets. 

2.4 Network Performance Benchmarks 

2.4.1 Nettest 

Nettest[F.AQ] is a network performance analysis tool developed at Cray. 

2.4.2 Netperf 

NetperffHom. Div96j is a benchmark that can be used to measure various aspects 
of networking performance. Its primary focus is on bulk data transfer and re- 
quest/response performance using either TCP or UDP and the Berkeley Sockets 
interface. There are optional tests available to measure the performance of DLPI 
{Data Link Provider Interface), Uni.x Domain Sockets, the Fore .\TM -A-PI and the 
HP IliPPI LLA interface. Netperf program can perform the following tests: 

• TCP Stream Performance 

• XTI(X/open Transport Interface) TCP Stream Performance 

• HDP Stream Performance 

• XTI UDP Stream Performance 

• DLPI Connection Oriented Stream Performance 

• DLPI Connection Stream 

• Unix Domain Stream Sockets 

• Unix Domain Datagram Sockets 

• Fore ATM API Stream 

• TCP Request /Response Performance 
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• TCP Connect /Request /Response 

• XT! TCP Request/Response Performance 

• I'DP Request/ Response Performance 

• XTI UDP Request/Response Performance 

• DLPI Connection Oriented Request /Response Performance 

• DLPI Connection Request/ Response Performance 

• Cnix Domain Stream Socket Request/Response Performance 

• Tnix Domain Datagram Socket Request /Response Performance 

• Fore ATM API Request /Response Performance 


2.5 Overall Performance Benchmarks 

2.5.1 SPEC Benchmarks 

SPEC (Standard Performance Evaluation Corporation) [Dix9L Org] is a new, evolving 
standard in benchmarking. SPEC, a non profit corporation was founded in 1988 by 
Apollo, HP, MIPS and Sun Microsystems. SPEC is a suite of different benchmarks 
meant to measure different aspects of systems performance. 

I New CPU Benchmarks: SPEC95 

These benchmarks measure the performance of CPU, memory system, and compiler 
code generation. They normally use UNIX as the portability vehicle, but they have 
been ported to other OS as well. The percentage of time spent in OS and I/O 
functions is generally negligible. 

CINT 95, integer programs, representing the CPU intensive part of system or 
commercial application programs. 

CFP95, floating point programs, representing the CPU intensive part of numeric 
scientific application program- 
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I Old CPU benchmarks: SPEC92 

CIN92: this suite contains six benchmarks performing integer computations, all of 
them written in C. 

CIF92: this suite contains 14 benchmarks performing floating point computa- 
tions. 12 of them zue written in FORTRAN and two in C. 

I SDM Benchmark suite 

SDM stands for ‘System Development Multitasking'; the benchmarks in this suite 
characterize the capacity of a system in a multiuser environment. 

I SFS Benchmarks Suite 

SFS stands for ‘System Level File Servery this is designed to provide a fair con- 
sistent and complete method for measuring and reporting NFS performance. SFS 
Release l.l contains one benchmark. 097.LADDIS. This benchmark measures NFS 
file server performance in terms of NFS response time and throughput. It does this 
by gem>rating a .synthetic NFS workload based on a workload abstraction of an NFS 
operation mix and an NFS operation request rate. 

Running 097.LADDIS requires a file server (the entity being measured) and two 
or more ‘load generators” connected to the file server. The load generators are each 
loaded with 097.LADDIS and perform the 097.LADDIS workload on file systems 
exported by the file server. The SFS Steering Committee is working on a new 
version of the SFS benchmark suite. Its main features will be the support for NFS 
protocol version 3 and to generate a modified workload. 

■ SPEC hpc96 Benchmark Suite 

At the supercomputmg’95 conference (Dec’95), the SPEC High Performance Com- 
puting Group (HPCG) announced the SPEC hpc96 benchmark suite, consisting of 
two benchmarks: 


22 



SPEC Seis96: an industrial application based on modern seismic processing 
programs used in search of oil and gas. 

SPEC Chem96: an improved version of the program called GAMESS, that 
come from the US department of Energy's National Resource for Computing in 
Chemistry. 

I Forthcoming SPEC Benchmarks 

• SPECcpu98 - A couple of important issues . besides others, axe that the pro- 
grams should be such that they can be made compute bound and that they 
can be made portable across difference hardware architectures and operating 
systems. The emphasis is again going to be on system’s processor, memory 
hierarchy and compiler. 

• SPECsmt97 - The goal of this benchmark is to have a measure of a System's 
Multitasking Tasking capabilities. Currently, this benchmark is being created. 
Work is also being done on developing a standardized set of development tools 
(e.g. GCC, Perl, GhostScript. etc.) so that it is ensured that each system tested 
is actually performing the very same amount of work (and is not dependant 
upon the varying degrees of bells and whistles that a vendor installs). 

• SPECwebQ? - This benchmark will measure thr performance of computers 
(both h/w and s/w) that are used as www servers. Similar to the SFS bench- 
mark, the benchmark code runs on one or more load generators (clients) that 
generate HTTP requests over a network(LAN). New issues under consideration 
are Dynamic Content, Support for Keep Alive and Multiple workloads. 

2.5.2 Khornerstone 

Developed by Workstation Laboratories, this benchmark[Pri89] yields a normalized 
rating on overall system performance using 22 separate tests. This suite of tests 
include a mix of both public-domain and proprietary benchmark routines. The result 
is a unit of measure called Khornerstones per second. This set of routines measures 
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characteristics of processor, floating-point, and disk performance. The Khornerstone 
test measures singie-user loads on a system and therefore does not accurateiv measure 
multiuser performance. 

2.5.3 AIM 

Aim Technology in Palo Alto[Pri89], CA sells and maintains two suits of multiuser 
benchmark tests. 

Suite III. written in the C programming language, this suite simulates applica- 
tions that fall into either task or device specific category. The task-specific routines 
simulate functions such as word processing, database management, and accounting. 
The <l<‘vice-specific code measures the performance of hardware features like memory, 
disk, floating-point, and I/O operations. All measurement represent a percentage of 
VAX 1 1 /7S0 performance. In general, the AIM Suite III gives an overall performance 
indication. 

Suite V measures throughput in a multitasking workstation environment. The 
design goals of this new suite includes incremental system loading to gradually in- 
crease the stress on system resources, and testing multiple aspects of system per- 
formance. The graphically displayed results plot the workload level versus time. 
Sev<‘ral tlifferent models characterize various user environments (financial, publish- 
ing, software engineering). The published reports are copyrighted. 

2.5.4 SysMark 

SysMark93[FAQ] for DOS and WINDOWS was introduced by The Business Applic- 
ations Performance Corp. (BAPCo), Santa Clara, CA, in 1993. This benchmark 
software provides objective performance measurement based on the world's most 
popular PC applications and operating systems. 

SYSmaxk93 provides benchmarks that can be used to objectively measure per- 
formance of IBM PC-compatible hardware for the tasks users perform on a regular 
basis. The benchmarks are comparative tools for those who make purchasing de- 
cisions for anywhere from 10 to a thousand or more PCs. SYSmark93 has been 
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endorsed by the BAPCo membership, which includes the world’s leading PC hard- 
ware and software vendors, chip manufacturers, and industry publications. 

S\ Smark93 benchmarks represent the workloads of popular programs in such 
applications as word processing, spreadsheets, database, desktop graphics and soft- 
ware development. Benchmarking can be conducted on the user’s own system or at a 
vendor s site using the standards set by BAPCo to ensure consistency of the results. 

2.5.5 Byte Unix Benchmarks 

This IS a benchmark suite[FAQ] similar in spirit to SPEC, except that it is smal- 
ler and contains mostly things like -sieve” and "dhrystone" . If one is comparing 
different Ihiix machines for performance, this gi^•es fairly good numbers. The suite 
include programs for measuring arithmetic overhead, system call overhead, filesystem 
throughput, pipe throughput, etc. 

2.6 Miscellaneous Benchmarks 
2.6.1 TPC 

The TP(” is a non-profit corporation founded to define transaction processing and 
database benchmarks and to disseminate objective, verifiable TPC performance data 
to the industry. TPC has defined the following benchmarks[TPC, JG91, FAQ]: 

I TPC-A 

TPC-.A is a standardization of the Debit/Credit benchmark which was first pub- 
lished in DATAMATION in 1985. It is based on a single, simple, update-intensive 
transaction which performs three updates and one insert across four tables. Trans- 
actions originate from terminals, with a requirement of 100 bytes in and 200 bytes 
out. There is a fixed scaling between tps rate, terminals, and database size. TPC- 
A requires an external RTE (remote terminal emulator) to drive the SUT (system 
under test). 
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I TPC-B 


TPC-B uses the same transaction profile and database schema as TPC-A, but elim- 
inates the terminals and reduces the amount of disk capacity which must be priced 
with the system. TPC-B is significantly easier to run because an RTE is not required. 


I TPC-C 

TPC-C is completely unrelated to either TPC-.A or TPC-B. TPC-C tries to model 
a moderate to complex OLTP system. The benchmark is conceptually based on an 
order entry system. The database consists of nine tables which contain information 
on customers, warehouses, <listricts, orders, items, and stock. The system performs 
five kinds of transactions: entering a new order, delivering orders, posting customer 
payments, retrieving a customer’s most recent order, and monitoring the inventory 
level of recently ordered items. Transactions are submitted from terminals providing 
a full screen user interface. (The specification defines the exact layout for eaoh 
transaction.) TPC-C was specifically designed to address many of the shortcomings 
of TP{’-.-\. It does this in many areas. It exercises a much broader cross-section 
of database functionality than TPC-.A.. Also, the implementation rules are much 
stricter in critical areas such as database transparency and transaction isolation. 
Overall, TPC-C results will be a much better indicator of RDBMS and OLTP system 
performance than previous TPC benchmarks. 

I TPC-D 

TPC-D suite is a Decision-Support benchmark (OLTP env.) suite. It performs the 
Database stress test, simulation of a large database, and complex queries tests. 

2.6.2 Hartstone 

Haxtstone is a benchmark for measuring various aspects of real time systems. This 
benchmark was designed at the Software Engineering Institute of Carnegie Mellon 
University. 
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2.6.3 PCBench/WinBench/NetBench/MacBench/ServerBench 

PC Bench 9.0, WinBench 95 Version 1.0, Winstone 95 Version 1.0, MacBench 2.0, 
NetBench 3.01, and ServerBench 2.0 {F.\Q] are the current names and versions of the 
benchmarks available from the Ziff-Davis Benchmark Operation (ZDBOp). These 
benchmarks axe used to measure the overall prformance of DOS PC, WIN PC, 
Macintosh, etc. 

2.7 Summary 

Tabh* 2.7 presents a summaxy of different benchmarks. An attempt is made to 
categorize the popular benchmarks on the basis of several characteristics. The char- 
acteristics are synthetic vs application, type of workload (ineger arithmetic, floating 
point arithmetic, scientific computing, supercomputing, disk I/O, NFS etc.). The 
table also includes the answers to several questions such as, whether or not the 
benchmark is CPU architecture sensitive, compiler sensitive, optimization sensitive, 
instruction cache sensitive, and, data cache sensitive. 

We conclude from the study of popular benchmarks that generic benchmark rating 
is like snileage rating of automobiles. These ratings guarantee that the product 
will newer t‘xceed the (juoted performance. As automobile vendors are required to 
include the following line with the performance rating: “Your actual mileage may 
vary according to road conditions and driving habits”, something similar should be 
included with the benchmark ratings of computer systems. 

The benchmarking programs should run under real life load for a long time dura- 
tion. The different instances of execution time (or some rating) for the same system 
load can be averaged. This helps users to get a measure of system performance 
for varying system load. The next chapter presents the design of the performance 
evaluation tool which measures the load on different components of a system, runs 
several benchmarking programs designed to generate different types of workload, 
and presents the measure of system performance with the current system load. 
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Benchmark Name 

Type 

Workload 

wm 


OS 

IChS 

DCh 

Whetstone 

Synth. 

Floating point 

N 

Y 

Y 

N 

N 

Dhrystone 

Synth. 

System programming 

Y 

Y 

N 

Y 

Y 

Disntal Review 

Synth. 

Integer and Floating point 

N 

N 

Y 

Y 

Y 

Sim 

Appli. 

Integer 

Y 

N 

N 

N 

Y 

Fhourstones 

Appii. 

Integer 

N 

Y 

Y 

N 

N 

HeapSort 

Appli. 

Integer 

Y 

Y 

Y 

N 

Y 

Hanoi 

Appli. 

Integer 

Y 

N 

N 

N 

N 

Lin pack 

Synth. 

Floating point 

N 

Y 

Y 

N 

Y 

Li\ermore Fortran Kernai 

■Appli. 

Floating point 

N 

Y 

Y 

Y 

Y 

N \S Kernel Benchmark 

Appli. 

Supercomputing 

Y 

Y 

Y 

Y 

Y 

ED.X Benchmarks 

Appli. 

.A mix of CPU bound instr. 

N 

Y 

Y 

Y 

Y 

Stanford Int. and Float. 

Synth. 

Integer and Floating point 

Y 

N 

Y 

Y 

Y 

Los Alamos Benchmarks 

Appli. 

Supercomputing 

Y 

Y 

Y 

Y 

Y 

Do<loc 

Appli. 

Scientific computing 

Y 

Y 

Y 

N 

Y 

Euroben Benchmarks 

Synth. 

Scientific (multiprossors) 

Y 

Y 

Y 

N 

Y 

Perfect Benchmark 

Appli. 

Supercomputing 

Y 

Y 

Y 

N 

Y 

Flops 

Synth. 

Scientific computing 

Y 

Y 

N 

N 

N 

Stream 

Synth. 

Scientific computing 

Y 

Y 

N 

N 

Y 

lOBECNCH 

Synth. 

Disk I/O 






lOZONE 

Synth. 

Disk I/O 






NFS Stone 

Synth. 

NFS I/O 






NHFS Stone 

Synth. 

NFS I/O 






Net perf 

Synth. 

Network 






SPEC 

Synth. 

CPU Int/Float, Disk, NFS 

N 

N 

Y 



Khornerstone 

Synth. 

CPU Int/Float, Disk I/O 

Y 

Y 

Y 



AIM 

Synth. 

CPU, Memory, Disk, Database 

Y 

Y 

Y 



SYSMARK 

Synth. 

DOS/VVINDOWS Applications 


Y 

Y 



Byte Unix Benchmarks 

Synth. 

Unix sys calls, CPU, disk 


Y 

Y 



TPC 

Synth. 

Database and Tran. Process 


Y 

. Y 




AS - CPU Architecture Sensitive (RISC/SISC), CS - Compiler Sensitive, OS - 
Optimization Sensitive, IChS - Instruction Cache Sensitive, DChS - Data Cache 
Sensitive. 


Table 1: A summary of the benchmarks 
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Chapter 3 

Design of the Performance 
Evaluation Tool 


The Performance Evaluation Tool records the different components of current system 
load, runs a series of test programs, measures the time taken by each test and 
reports the results. This tool does not provide a single number to represent the 
system performance, as most of the other benchmarks do, but prints the time taken 
by <lifferent tests along with the components of system load which might affect the 
execution time of that test. 


3.1 Scope of the Performance Evaluation Bench 

The test suit is designed to measure the performance of the most utilized components 
of a system. The current scope of this tool is limited to measuring the performance 
of CPr. disk I/O, NFS, and network. The tool contains synthesized benchmark 
programs designed to approximate the workload of the IITK environment. However, 
this tool can be used in other environments by giving a different weight to the test 
programs and by including some more test programs in the test suite. The tool is 
designed to work on the Unix (or Unix clone) platforms but it can be ported to 
other platforms by modifying the timing routines and the programs that capture the 
system load. 
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3.2 Collecting System Parameters and Current Load 

The different system parameters, that are significant in the comparison of two sys- 
tems are (^PU type and word size, number of processors (in multiprocessor systems), 
CPI clock frequency, cache size, caching policy, size of physical memory, number 
of hard disks, disk controller technology, DMA, type and bus width of network ad- 
apter. maximum speed of underlying network, and name and release of operating 
system. Unfortunately not all of these parameters can be obtained automatically. 
Some systems store the values of a few parameters in header files such as number 
of clock ticks in a second, and some other parameters can be obtained through the 
system c'all interface. For example. OS name, release, and CPU name can be ob- 
tained by the uname system call. Linux provides some information about the CPU, 
devices, DMA, and memory through the proc file systemL Some parameters which 
are t ransparent to kernel, such as cache and network speed, can’t be obtained using 
the OH services. The only possible way to get this information about the system 
parameters is to get it manually. 

System load can be obtained by the services and interfaces provided by the 
operating system. Major sources for this information are the proc file system, 
/dev/kmem file'*, ps program, uptime program, and w program. Since read- 
ing the proc file system and /dev/kmem file requires root access on the system, 
the utility programs (ps, uptime, w. etc.) remain the only source to get information 
as a normal user. 

Following attributes of system load are available: 

• Number of active login sessions 

• Load average (number of jobs in the run queue for the last 5 seconds, the last 
30 seconds, and the last 60 seconds). 

• Number of processes running on the system. 

^proc is a pseudo-filesystem which is used as an interface to the kernel data structures. It is 
normally mounted on /proc directory 

^/dev/kmem is a character device file that is an image of the kernel virtual memory. Byte 
addresses in kmem are equivalent to mapped kernel memory addresses. 
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• Detailed information about each process 

Following information about each process is significant in measuring system load: 

• Flag: This field tells if the process is in core, is being traced, is doing physical 
I/O, etc. 

• State: Symbolic process state such as Sleeping, Idle, Zombie, Running or 
Ready. 

• WCHAN: .\ddress of the event on which a process is waiting. This can be 
used to know whether a process is waiting on disk I/O, or on Network I/O, or 
on tty input, or waiting for the termination of a child process. 

• Uid: User id is used to check whether the process is a root or non root process. 
Root processes are normally daemon processes which run all the time. 

• VSIZE: Process's virtual address size. 

• STIME: Start time and date of process. 

• USERTIME: CPU time used in user space. 

• SYSTIME: CPU time used in system space. 

• Sleep Time; Time passed by process in sleeping. 

• PCPU: Percentage CPU usage. 

• PMEM: Percentage real memory usage. 

• MAJFLT: Number of major page faults. 

• MINFLT: Number of minor page faults. 

• BLKIN: Block input operations. 

• BLKOUT: Block output operations. 

• IVCSW: Involuntaxy context switches. 
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Not ail systems provide all the attributes listed above. 


The following structure definition is used to store the number of login sessions, load 
average, and uptime information. 


struct uptime 

{ 


int days; 

/* Number of days, 

*/ 

int hours; 

/* hours, 

*/ 

int minutes 

; /* auid minutes since 

*/ 


/* since system was booted. 

*/ 

int users; 

/* No of users (login sessions) 

*/ 


float load!; load average (last 5 seconds)*/ 

float load2: /* load average (last 30 seconds)*/ 

float loads ; /* load average (last 60 seconds)*/ 

}; 

The uptime program provides the values for all fields of the uptime structure. 
The output of utility programs and the interface of the proc file system are not 
identical on different platforms. This tool contains a collection of routines for each 
system that can be used to record the components of system load. Other routines 
such as those used for analysing system load and running test programs, are same 
for all the systems. A well defined interface is provided for passing information 
among the system specific and system independent routines. The following structure 
definition is used to 

struct processes, 
int uid; 
char state [43 ; 
int cputime; 
int usertime; 
int systime; 
int St art time; 
int flag; 


store the attributes of a processes. 


_info { 


/* User id 


*/ 

/* Symbolic state 


*/ 

/* time in seconds x 

100 

*/ 

/* time in seconds x 

100 

*/ 

/* time in seconds x 

100 

*/ 

/* start time zind date 

*/ 

/* Flag 


*/ 
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int majflt; 

/* Mo of major page faults 

*/ 

int minflt; 

/* No of minor page faults 


int inblock : 

/* Block input operations 

*/ 

int outblock; 

/* Block output operations 

*/ 

int pcpu; 

/* 100 X */.CPU 

♦/ 

int pmem; 

/* 100 X ‘/.MEM 

*/ 

int vsize; 

/* Size of virtual address 

*/ 

int rsize; 

/♦ Real memory size 

*/ 

char wchanC4] ; 

/* Add. of the event for wait 

*! 

char command [8] 

;/♦ Command line 


int etime; 

/* Elapsed time since started 

*/ 

struct processes_info *next; 



}; 


System specific routines pass a linked list to the system independent routines. 
.Above structure is the definition of one node of the linked list. 

No single common interface is available to get information about processes. Fol- 
lowing sections provide a brief overview of the techniques used on different systems 
to collect the process attributes. 

3.2.1 DEC/OSFl and Digital Unix 

Kernel exposes process information through proc file system. For each running or 
zombie process, there is an entry in the system process table, which appears as a 
file name in the /proc directory. The file name is the decimal representation of the 
process id. The Toctl’ system call is used to get meaningful information from the 
file in the proc file system. However, the read permission for the files in proc file 
system is restricted only to the owner of the process. Therefore, only root process 
can read all the files. 

‘ps’ program in DEC/OSFl is a swid program and it provides all the information 
needed about the processes. The ‘ps’ program takes field names to be listed, as 
command line parameters with -o option. 
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3.2.2 SunOS 5 


SunOS provides the proc file system similar to that of DEC OSF/1, but the 'ps' 
program is not as powerful as on DEC/OSFl. It lists only flag, state, uid, rsize, 
wchan, stime, cputime, and command fields of the fields that this tool uses. 

3.2.3 SunOS (Solaris version) 

The Solaris version of the SunOS has the same proc file system interface as SunOS 
5. However, the ’ps' program is more powerful than that of SunOS 5. Some more 
fields such as pcpu, pmem, vsize. and etime are listed. Only selected fields can be 
listed in the output of ‘ps’ as is done in the case of DEC/OSFl. 

3.2.4 HP-UX 

HP-l'X does not provide the proc file system. Process information is obtained from 
the /dev/kmem file using 'ioctP system cail. ‘ps’ program provides only flag, 
state, uid, rsize, wchan, stime, cputime, and command fields. 

3.2.5 IRIX64 

IRIX64 (on SG machines) provides the proc file system similar to DEC/OSFl. The 
‘ps’ program allowis -o option to specify the field names on comand line. Some fields 
such as ‘majflt’, ‘minflt’, ‘inblock’, ‘outblock’ etc. are not available in the output of 
‘ps’. 

3.2.6 Linux 

Linux has an exhaustive proc filesystem. There is a subdirectory for each running 
process instead of a single file as in other systems. The subdirectory is named by the 
process ID. Each subdirectory contains the following pseudo-files and directories. 

• cmdline: This holds the complete command line for the process, unless the 
whole process has been swapped out or it is a zombie. 
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• cwd: This is a link to the current working directory of the process. 

• environ; This file contains the environment for the process. 

• exe: .A. pointer to the binary which was executed. This appears as a symbolic 
link. 

• fd: This is a subdirectory containing one entry for each open file, named by 
its file descriptor. It is a symbolic link to the actual file. 

• maps: .A file containing the currently mapped memory regions and their access 
permissions. 

• mem: This is the not same as the /dev/mem device, despite the fact that it 
has the same device numbers. The /dev/mem device is the physical memory, 
while the mem file here is the virtual memory of the process. 

• root: Root points to the file system root, set by the chroot system call. 

• stat: Status information about the process. This is used by the ‘ps' program. 


Some other files are present in /proc directory which are not specific to a process 
but are global to the system. These files are: 

• cpuinfo: This is a collection of CPU and system architecture dependent items. 

• devices: Text listing of major numbers and device groups. 

• dma: This is a. list of the registered ISA DMA (direct memory access) channels 
in use. 

• filesystems: A text listing of the filesystems which are supported by the 
kernel. This is used by ‘mount’ program to cycle through different filesystems 
when none is specified. 

• interrupts: This is used to record the number of interrupts per IRQ on the 
i386 architecture. 
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• ioports: This is a list of currently registered Input-Output port regions that 
are in use. 

• kcore: This file represents the physical memory of the system and is stored in 
the core file format. 

• kmsg; This file can be used instead of the sysiog system call to read kernel 
messages. 

• ksyms: This holds the kernel exported symbol definitions used by the mod- 
ules tool to dynamically link and bind loadable modules. 

• net: This is a subdirectory which contains various net pseudo-files. Each one 
gi\'es the status information of a part of the networking subsystem. 

• pci: This is a listing of ail PCI devices found during kernel initialization and 
their configuration. 

• scsi: A directory which contains a file for each SCSI host in the system. 

• self; This directory refers to the process accessing the /proc filesystem, and 
is identical to the numerically named subdirectory for the process. 

• stat: Kernel/system statistics. This file contains the statistics of cpu, disk, 
pages, swaps, interrupts, context switches, and boot time. 

• uptime: This file contains two numbers: the uptime of the system (seconds), 
and the amount of time spent in idle process (seconds). 

• version: This file contains the string which identifies the kernel version that 
is currently running. 

3.3 Measuring the Network Load 

Network load relates to the whole LAN and not to a particular system. One server 
runs on each LAN to measure the network traffic on it. The client program connects 
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to the server (IP a<i<hess of the machine and I’CIP port of the server are known to the 
client) and gets the total nitmher of packets and the average packet size during the last 
‘n’ minutes. Server uses the pcap (packet capture) library. Authors of this library 
are \'an .Jacobson, (’raig heres and Steven McCanne, of the Lawrence Berkeley 
National Laboratory, UC Berkeley. I his library provides a high level interface to 
packet capturing systems. AH packets on the network, even those destined for other 
hosts, are accessible through this mechanism. The server runs with root privilege and 
the network interface is put into promiscuous mode. Design of a network monitoring 
system for ilTK network is discussed in (Rao97]. 
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'I’ahle 2: Accessing Parameters in Different Systems 


3.4 Measuring tlie System Performance 

This tool includes several test programs designed to synthesize the workload of IITK 
:omputing environment. As discussed in the first chapter, the workload of IITK is 
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a mix of interactive login sessions and batch jobs. Disk access is a mix of local disk 
I/O Imost of the users prefer to work on the system hosting their home directory), 
and NFS I/O (Working on workstations requires access to the user’s files on the 
server's disk). On the higher level, the network load consists of the TCP load due 
to the remote login sessions. X-traffic, RPC (mainly due to NFS), and HTTP. File 
transfer using ftp is not very frequent but it takes considerable bandwidth. 

Following sections present an overview of the test programs used by this tool. 

3.4.1 Integer Only Test 

.Not found in real life applications hut ‘integer only' performance of a system directly 
measures the power of its CPF in integer arithmetic. This test includes a mix of ad- 
dition, subtraction, multiplication, division, assignment, increment, and conditional 
jump instructions. A function containing arithmetic operations is called in a loop. 
The loop runs a fixed number of times. 

3.4.2 Floating Point Arithmetic Tests 

This lest is not actually a pure fioating point test but a mix of several numerical op- 
erations. It includes the arithmetic operations on array elements, conditional jumps, 
integer arithmetic. Trigonometric functions (atan, sin, cos) and standard functions 
(sqrt. exp, log). The idea of various floating point operations is taken from Whet- 
stone. 

3.4.3 Fast Fourier Transform 

This suite includes the routines to compute Hartley transform, Fourier transform, 
inverse Fourier transform, real-valued Fourier transform and inverse of the real- 
valued Fourier transform of ‘n’ points. This suite is a good representative for the 
CPU bound jobs with mixed type of instructions. The code is written by Ron Mayer 
of Acuson. 
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3.4.4 Matrix Multiplication 

Matrix multiplication is a common operation in several scientific programs in Engin- 
eering. Mathematics, and the Science. This program performs the matrix multiplic- 
ation of large size matrices. The program is sensitive to the paging scheme used in 
virtual memory system and page size. Also, the distance of the first matrix element 
from the page boundary is critical for the performance. 

3.4.5 N-Queens Problem 

This |>rogram limbs all the possible ways for ‘N’ queens to be placed on an NxN 
chessboard such that they do not capture one another. That is, so that no rank, file 
or diagonal is tu'cnpied by more than one queen. This program is a good example 
of recursion. The program synthesizes a typical scientific program that makes use 
of pointers, complex data structures, and recursion. 

3.4.6 Unix System Calls Overhead Test 

This suite includes various routines to synthesize the general Unix system call work- 
load. The rotitine.s use fork, exec, pipe, dup, open, close, and some other system 
calls. Home short life duration processes {such as ‘Is’) are created periodically. 

3.4.7 Compiler and linker’s Performance Test 

A 'C’ program is compiled aad linked with the same optimization options to compare 
the performance of C-preprocessor, C-compiier, and linker on different systems. 

3.4.8 File System Performance Check 

This suite includes the routines to check the file system’s performance. The routines 
perform a number of operations such as read, write, Iseek, sync, copy etc. using the 
file system services. 
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3.4.9 Disk I/O Performance Test 

It performs a series of tests on a file of known size. The default size is 50 MB. 
large tile size is chosen to overwhelm the buffer cache. The idea is to make sure that 
these are real transfers to/from user space to the physical disk. The tests are: 

I Sequential Output 

• Per character: The file is written using the putc() stdio macro. The loop 
that does the writing is small enough to fit into any reasonable cache. The 
('Pr overhead here is that required to <lo the stdio code plus the OS file space 
allocation. 

• Block: The file is overwritten using the write system call. The CPU overhead 
should be just the OS file space allocation. 

• Rewrite: Each block of the file is read with the read system call, dirtied, and 
rewritten with write system call. Since no space allocation is done, and the 
I/O is well-localized, this should test the effectiveness of the filesystem cache 
and the speed of <iata transfer. 

I Sequential Input 

• Per-Character: The file is read using the getc() stdio macro. Once again, 
the inner loop is small. This should check the performance of only stdio and 
sequential input operations. 

• Block: The file is read using read system call. This should be a very pure test 
of sequential input performance. 

I Random Seeks 

This test runs several processes in parallel, doing a total of 4000 Iseeks to random 

locations in the file. In each case, the block is read with read system call. In 10% 
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of cases, it is dirtied and written back with write system call. The idea behind the 
parallel seeking processes is to make sure that there is always a queued up seek. 

3.4.10 NFS Performance Test 

The following operations are performed on a NFS mounted directory [Seix93]; 

• Makedir: Constructs subtree consisting of 15 directories. The directories are 
constructed hierarchically as well as in flat structure. 

• Remove dir: Removes the subtree created in Makedir. 

• Copy: (’opies all the files from one ilirectory to another. 

• Move: Moves all the hies from one directory to another. 

• Delete Dir: Deletes all the files from the specified directory. 

• Create files: Creates 100 files in a directory. 

• Read file: Repeatedly reads a fixed number of bytes from a file. 

• Write file: Repeatedly writes a fixed number of bytes into a file. 

3.4.11 NFS Performance Test using NFS Stone 

The popular benchmark program ‘NFS Stone' discussed in section 2.3.3 is used to 
measure the number of NFSStones a NFS server can serve in a second. 

3.4.12 Network Performance Tests Using Netperf 

Netperf, a benchmark program form HP (discussed in section 2.4.2), is used to 
measure the TCP stream performance, TCP request/ response performance, UDP 
stream performance, and UDP request/response performance. 
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Chapter 4 

Results and Comparisons 


4.1 Introduction 

One of the important steps in every performance evaluation study is the presentation 
of final results. The ultimate aim of every performance analysis is to help in decision 
making. 

Oraphic charts are used to present performance results. There are a number of 
reasons why a graphic chart may be used for data presentation in place of textual 
explanation [Jai91]. A graphic chart saves a reader’s time and presents the same 
information more concisely. 

This tool runs on diifFerent machines and records the system load and test results 
over a long period of time. The tool includes a program to generate artificial load 
to measure the performance of an idle system under different load conditions. 

The tests were run on a myriad of systems available at IIT Kanpur. The systems 
were; 

• agni, a HP-9000/735 machine with 144MB RAM and SCSI hard disks running 
HP-UX release A.09.03 version C. Cost of the system at the time of purchase 
(in Sep 1994) was US$ 50,607. 

• pc47, a Pentium l66Hz machine with 32MB RAM and an IDE disk running 
Linux 2.0.0. Cost of the system at the time of purchase (in Jan 1997) was Rs. 
55,000. 
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• sg2, a 180 MHZ, Origi3Q 200 Silicon Graphics (IP27 CPU boaxd) machine with 
128MB RAM iuid SCSI fast wide disks rnnning IRIX64. Cost of the system 
at the time of purchase (in Feb 1997) was USS 75,000. 

• cd3, a 233MHZ DEC Alpha 2000 dual processor machine with 128MB RAM 
and fast wide hot swappable SCSI disks running OSFl V3.2. Cost of the 
system at the time of purchase (in July 1996) was Rs. 1,200,000. 

• shakti, is a 233MHZ DEC .\lpha 2000 machine with 128MB RAM and SCSI 
«lisks running OSFl V3.2. 

• cul. a 167MHZ Sun Sparc Ultra 1 machine with r28MB RAM and fast wide 
S(’SI disks running SUNOS o.o.l (Solaris version). Cost of the system at the 
time of purchase (in June 1996) was Rs. 850,000. 

For each of the above systems, the execution time (for tests) has been plotted 
against total no. of processes and no. of login sessions. This particular choice has 
been made because execution time and no. of processes/login sessions are good 
representatives of the system performance and system load, respectively. 

4.2 Observations 

Following axe some of the inferences which can be made directly from the plots. 

4.2.1 Integer Arithmetic Test 

This test includes a mix of addition, subtraction, multiplication, division, assign- 
ment, increment, and conditional jump instructions. This test directly measures the 
integer aurithmetic performance of the CPU since there is no system call and I/O 
overhead and no support from floating point coprocessor is required. Figure 2 shows 
the results. We gather following information from the graph. 

• sg2 has minimum execution time for this test. 

• cd3 gives a steady performance with increasing system load. 
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• shakt! become very slow when the system load increases 

• cul gi%'es same performance as cd3. 

• agni is slower than other systems (but shakti) and its performance goes down 
with increasing system load. 

• pc47’s performance goes down as the load increases but at low load it is better 
than all systems but sg2. 
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Figure 2: Integer Test 
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4.2.2 Floating-Point Arithmetic Test 

This test is not actually a pure floating point test but a mix of several numerical op- 
erations. It includes the arithmetic operations on array elements, conditional jumps, 
integer arithmetic. Trigonometric functions (atan, sin, cos) and standard functions 
(s<}rt. exp, log). Figure 3 shows the results of floating-point test. Following is a 
summary of observations from the graph. 

• sg2 gives the best performance. 

• <-d3 is e.xceptionally stable with increasing system loads. 

• P<Tformance of shakti is very poor as compared with other systems. 

• agni gives saune performance as cd3 at low system load but it become slower 
with increasing system load. 

• pc47 is slow than all other systems but shakti. 

• cul is as good as cd3 but slower than sg2. 
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4.2.3 FFT Test 


This suite includes the routines to compute Hartley transform, Fourier transform, 
inverse Fourier transform, real-valued Fourier transform and inverse of the real- 
\'alued Fourier transform. This suite is a good representative for the CPU bound 
jobs with mixed type of instructions. The graph is shown in figure 4. We notice 
following points from the graph. 

• sg2 is again the best system. 

• cdS is slower than sg*2 but shows stable performance at very high system load. 

• cu 1 gives same performance as cd3. 

• agni is average but the performance goes low at higher system load. 

• shakti is slow and become more slow at high load. 

• pc47 gives similar performance as of shakti. 
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4.2.4 Matrix Multiplication Test 

This program performs the matrix multiplication of large size matrices. The test runs 
best on multiprocessor systems but the operation is common in several mathematical 
problems that run on single or dual processor systems. The graph is shown in 
figure 5. Following is a brief summary of observations. 

• As expected pc47 is not a good system for this type of computations. 

• shakti is slower than all other systems but pc47. 

• sg2 is again the best. 

• agni is as good as sg2 at low system load which is surprising. 

• cd3 is agwn very steady in performance with increasing system load. 

• cul’s performance is average. 
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4.2.5 N-Queens Test 


This prograJTi finds all the possible ways for ‘N’ queens to be placed on aji NxN 
chessboard such that they do not capture one another. The program synthesizes 
a typical scientific program that makes use of pointers, complex data structures, 
and recursion. Only int^er instructions are present in this program (but not much 
integer arithmetic). Figure 6 shows the graph. Following is a brief summary of 
observations. 

• cd3 gives the best and steady performance for this test. 

• rul is as good as cd3. 

• sg2 is left behind by cul and cd3 in this test. 

• pc47’s performance varies in a large range. It becomes very slow at high load. 

• agni is average but it shows fluctuating performance at higher load. 

• shakti is again slower that all other systems and its performance goes very slow 
at higher load. 
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Figure 6: N-Queen Problem Test 




4.2.6 Unix System Call Overhead Test 

This suite includes various routines to synthesize the general Unix system call work- 
load. The routines use fork, exec, pipe, dup, open, close, and some other system calls. 
Some short life duration processes (such as ‘Is’) are created periodically. This test 
is intended to measure the system performance for normal interactive load. Figure 7 
shows the graph. A summary of the observation presented below. 

• Surprisingly pc47 gives the best performance in this test. Which means that 
Linux in the best operating system as far as system call overhead is concerned. 
But the performance goes down as system load increases. This may be because 
of the less powerful CPU on the PC as compared with other systems. 

• cul is the second best system for this test. This was expected as the Solaris is 
considered as one of the best OS available. 

• sg2 is slower than pc47 and cul. 

• agni is average at low system load but the performance decreases as the load 
increases. 

• cd3 is slower than all other systems but shakti. 

• shakti is the slowest and the performance goes down with increasing system 
load. We should note that both cd3 and shakti are DEC Alpha machine running 
OSF/1. The Unix emulation runs on the top of Mach OS. This may be the 
reason for low performance of these systems as compared with others in the 

lot. 
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4.2.7 Compiler and linker’s Performance Test 

This test compiles ajid links a. ‘C program, with the same optimization options 
to compare the performance of C- preprocessor, C-compiler, and linker on different 
systems. Figure 8 shows the graph. Observations are summarized below. 

• sg2 gives the best {>erformance at low system load but it goes slow when load 
increases. 

• cd3 is a unstable in performance. 

• The performance of cui is the best and it is unaffected by the increasing system 
load. 

• agni is slower than cul and sg2 and it is some what stable at low system load, 
but shows fluctuating performance at higher load. 

• pc47 is not a good machine for this job. 

• shakti is slower than all other machines but pc47. 
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Figure 8: C Program Compilation Test 



4.2.8 Sequential Write Test 

In this test the file is written using the putc{) stdio macro. The CPU overheaxl here 
is that required to do the stdio code plus the OS file space allocation. Figure 9 shows 
the graph. We observe following points from the graph. 

• sg2 gives the fastest disk write time. The time remain constant with the in- 
creasing load on the system. 

• cd3 is very close to sg2 and its performance is steady with the increasing load. 

• cul is average and gives almost constant performance with increasing system 
load. 

• agni is slower than sg2, cd3 and cul at high load and its performance remain 
steady with small fluctuation. 

• hhakti is close to agni at low system load but its become very slow at high 
load. 

• pc47 is better than agni and shakti at low system load but its performance 
decreases at higher load. 
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4.2.9 Fast Write Test 

The file is overwritten using the write system call. The CPU overhead should be just 
the OS file space allocation. Figure 10 presents the results as a graph. We gather 
following observation from the figure. 

• sg2 is very fast in blockwise sequential write. 

• <‘ui gives the same performance as sg2 but its performance was not close to 
sg'i in per character sequential write. This means that disk-Io time for sg2 and 
ctil is almost same and the more time taken by cul in per character write may 
be because of the slower CPU {CPU overhead is large in per char write), or 
the file system space allocation time is higher on cul. 

• cd3 improves in block write and its performance remain steady with higher 
load. 

• agni gives a very poor performance as compared with other systems. We should 
note that the time taken by agni (around 40 seconds) in per block write is same 
as it took in per char sequential write. 

• shakti also improves but its performance goes down at high load. 

• The performance of pc47 is only better than agni and it deteriorates with 
increasing system loads. 
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4.2.10 Rewrite Test 

Each block of the file is read with the read system cedi, dirtied, and rewritten with 
write system cadi. Since no spane allocation is done, and the I/O is well-localized, this 
should test the effectiveness of the filesystem cache and the speed of data transfer. 
Result graph is shown in figure 11. Following are the few observations from the 
graph. 

• cul gives the best performance for this test. Which indicates that the file 
system of the cul is very efficient. 

• 8g2 is a little slower than cul. 

• agni improves a lot in this test. 

• cd3 gives same performance as agni but it shows some variation with increase 
in load. 

• shakti is also close to cd3 and agni but its performance goes slow with increas- 
ing load. 

• pc47 is very poor as compared with other systems. Its performance goes even 
more down at higher load. 
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4.2.11 Sequential Read Test 

The file is read using the getc() stdio macro.This should check the performance of 
only stdio and sec^uential input operations. Figure 12 presents the graph for this 
test. Observations are summarized below, 

• sg2 gives the best performance in per character sequential read too. 

• rd3 is very close to sg2 and remain unaffected by the increasing load. 

• agni is better in read operation than it was in write operation. 

• cul is surprisingly slower than a^ni in this test. 

• shakti is again slower than all other systems but pc47 and its performance goes 
down with increasing system load. 

• pc47 is close to agni and cul at low load but it become extremely slow at high 
load. 
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4.2.12 Fast Read Test 

The file is read using read system cail. This should be a very pure test of sequential 
input performance. Figure 13 shows the result graph for this test. Points observed 
from the graph are listed below. 

• cui is the fastest madbiine for this test. > 

• sg2 is very close to cul with a little fluctuation in performance. 

• agni is as good as sg2,agni,and cul. Its performance remains steady at the 
high load. 

• cd3 is not very far from cul, sg2 and agni but shows some fluctuation in 
performance. 

• shakti is slower than cul, sg2, aigni and cd3 but this time the difference is not 
very large. 

• pc47 is again very poor and its performance goes down at high system load. 
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Fiffure 13i Fast Read Test 
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4.2.13 NFS Performance Tests 

NFS performance tests were executed on pc47 after mounting the exported direct- 
ories of remote systems through NFS. Five systems were compared in this test. 

• sk3 directory of sbakti. 

• agl directory of agni. 

• kl directory of bhaskar (a HP 9000/819 system). 

• ul4 directory of cul. 

• g21 directory of sg2. 

The performance is shown against the number of bytes on the network in the last 
five minutes. It is observed that the network load does not affect NFS performance. 
It means that network bandwidth is not a bottleneck in NFS performance. 

I NFS Test 

NFS test includes several NFS operations such as make dir, remove dir, copy, move, 
delete, create, read, write etc. The graph is shown in figure 14. Some observations 

are as follows. 

• 3g2 is the best NFS server available. Its performance is fax more better than 
the other servers tested. 

. cul is the second best NFS server in the lot but there is a large gap in the 
performance of cul and sg2. 

. bhaskar, agni, and shakti are give neatly same performance, agni is a bit slow 
than other two. 
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I NFSSTONE 

NFSstone produces results ic NFSstones per second which is the direct measure 
of NFS performance. We should note that the system plotted higher on the graph 
gives the higher performance. Figure 14 presents the result graph for this test. 
Observations are listed below. 

• sg2 is the best NFS server as rated by the NFS test. 

• cul is better than all other systems but sg2. 

• shakti, agni, and bhaskar axe almost similar in performance but we can see that 
shakti is best among these three, second is agni and the slowest NFS server is 
bhaskar. 
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Chapter 5 
Conclusions 


A perfortnanrp evaluation tool has been designed to evaluate the performance of 
computer systems in the IH’K environment. The results can help a system adminis- 
trator to assign the appropriate systems for different types of jobs. The tool can also 
be used to make purchase decisions for new machines and to find the bottlenecks in 
case of low performance. The tool includes a program to generate artificial load so 
that it can be used to evaluate the performance of a new system at the vendor’s site. 

'I'he system loatl capturing routines record a number of things such as number of 
page faults, block input and output operations, processes waiting status etc. This 
information can be used to address various issues. Two such issues are the page fault 
behavior with increasing system load and the service bottleneck for waiting processes 
(such ns iK'twork, <lisk I/O) 

5.1 Summary of the Test Results 

Several systems installed in Computer Center and the Department of Computer 
Science and Engineering were evaiuated. Chapter 4 presents the result graphs which 
only includes the performance of six systems due to plotting limitations. We observe 
that the new Silicon Graphics machines are the fastest machines available in the 
Computer Center. SG machines show the best performance for almost all the test 
suites. DEC Alpha 2000 (dual processor) machine of CSE department is very close 
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to tho S<J machines in performance. It was noted that the DEC Alpha 2000 machine 
sliows .steady performance at very liigh system load. Sun Ultra Sparc machine of 
CSEi tk'partment is also a reasonably fast machine. The HP machines and single 
processor DEC Alpha systems installed in Computer Center are not good enough to 
be used as a server. Pentium PC running Linux is not a good machine for scientific 
com{>uting or for using it as a server, but it is a good workstation for single user. 

It was observed that some of the machines are heavily loaded and some machines 
are not utilized properly. SG machines are very powerful but they axe not loaded 
properly. Some more users should be given accounts on the SG machines. Since these 
machines are good for scientific computing, the users who want to run long batch jobs 
should be given login on SG machines. Ultra Sparc system of CSE department is 
also very lightly loaded system. At the same time other servers of CSE department 
are heavily loaded. This power of this system should be utilized by running the 
server programs (e.g. NIS, DNS, mailserver etc.) on this system. 

Another observation is that the systems with low CPU and disk I/O performance 
are used as the main NFS servers. It is believed that the network bandwidth is a 
bottleneck in the NFS performance. Test results show that the NFS performance 
is not affected by the network load. This means that the network is not saturated. 
Fast machines such as SG and Sun Ultra should be used as the main NFS servers. 

5.2 Further Extensions 

Work can be done to extend the scope of this tool Following sections present some 
of the features that can be added in the present structure of this tool. 

5.2*1 X-Server Performance Measurement 

Some test programs can be designed to generate the workload for a X-Server. Some 
additional parameters should be defined in order to measure the current load on a X- 
Server. For instance, number of active windows, average number of mouse operations 
per unit time, mode (number of colors and resolution) of the X-Server etc. Another 
issue which matters is whether X-applicatioiis are running on the workstation or on 


72 



some remote machines. 


5.2.2 WWW Server Performance Measurement 

The World Wide Web (WWW) is today’s most popular and advanced information 
system. Most of the leading universities and their departments provide information 
related to admissions, courses, faculty profile, and ongoing research, on their web 
page. Selecting a good machine as web server and monitoring its performance has 
become important now. This tool can be extended to measure the WWW server per- 
formance by adding a program that generates http requests and records the response 
time, 'fhe machine running the test program and the machine running WWW server 
should be on the same LAN, as one is measuring the WWW server’s performance 
and not the network bandwidth. Log of WWW server can be used to obtain the 
current load on the server. 

5.2.3 Measuring the Performance of Multiprocessor Systems 

A central issue in understanding performance measurement on multiprocessor ar- 
chitectures involves the pairing of architectures and applications. Some scientific 
programs requiring multiprocessor architecture should be included in the test suite 
to measure the performance of multiprocessor systems. The programs should be 
carefully selected as the suitability of code for the underlaying hardware and OS 
matters a lot in multiprocessors. [JLM88] and [MAMC86] are good texts available 
on this subject. 

I'he test program not only measures the power of the processors but also the 
efficiency of the compiler and the operating system. 

5.2.4 More on Network Performance 

This tool does not examine the network performance in detail. The network load 
should be characterized to get the statistics of load generated by different applications 
using different protocols. Some early work in monitoring network traffic is discussed 
in [DEMK76]. A hierarchical approach is required to characterize the network load 
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on tliirorent layers. I‘or example in a network where different network layer protocols 
are Ijeing used, the first level can be to distinguish among the network layer protocols, 
such as, IP, ARP, ICMP (of TCP/IP suite), NetBios, IPX (of Novel), and NetBUI 
(of Microsoft). In the next level we can examine the IP packets for TCP or UDP. 
Further the packets can be examined for application layer protocols, such as, HTTP, 
FTP, 'I’FTP, SMTP, NNTP, RPC, telnet, X-Server, YP, etc. 

'Phe fundamental issue in network performance measurement is throughput meas- 
ureinont [K077j. However, the test programs used to generate the workload of 
different protocols should be used to measure the network performance in a more 
meaningful way. [H086] presents an overview of performance analysis of LANs. 

5.2.5 Measuring the Performance of Database Systems 

Several ad-hoc benchmarks for database and transaction-processing systems exist. 
Most of the database and transaction-processing benchmarks measure the perform- 
ance in vague metrics such as transactions per second and queries processed per 
second. Each “database product vendor” has implemented the standard ad-hoc 
benchmarks so that they can measure their product’s performance against the other 
products. Work can be done to study transaction processing systems and to design 
a benchmark suite specifically for these systems. Some key issues in analyzing the 
database system architectures are discussed in [SByMST]. A comprehensive view of 
benchmarking for modern transaction processing and database systems is presented 

in (JG91). 
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